Guide to using dplyr

We'll be covering the following functions:

  • filter() (and slice())
  • arrange()
  • select() (and rename())
  • distinct()
  • mutate() (and transmute())
  • summarise()
  • sample_n() and sample_frac()

Installing

You can install dplyr using

In [ ]:
install.packages('dplyr')
In [12]:
# Run it using
library(dplyr)

Example Data

Let's use some flight data for our examples. We'll download the nycflights13 data package:

In [ ]:
install.packages('nycflights13',repos = 'http://cran.us.r-project.org')
In [26]:
library(nycflights13)
summary(flights)
Out[26]:
      year          month             day           dep_time    sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
 Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
                                                 NA's   :8255                 
   dep_delay          arr_time    sched_arr_time   arr_delay       
 Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
 Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
 Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
 Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
 NA's   :8255      NA's   :8713                  NA's   :9430      
   carrier              flight       tailnum             origin         
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        
                                                                        
     dest              air_time        distance         hour      
 Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
 Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
 Mode  :character   Median :129.0   Median : 872   Median :13.00  
                    Mean   :150.7   Mean   :1040   Mean   :13.18  
                    3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
                    Max.   :695.0   Max.   :4983   Max.   :23.00  
                    NA's   :9430                                  
     minute        time_hour                  
 Min.   : 0.00   Min.   :2013-01-01 05:00:00  
 1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
 Median :29.00   Median :2013-07-03 10:00:00  
 Mean   :26.23   Mean   :2013-07-03 05:02:36  
 3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
 Max.   :59.00   Max.   :2013-12-31 23:00:00  
                                              
In [27]:
# Notice how large the data frame is:
dim(flights)
Out[27]:
  1. 336776
  2. 19

filter()

filter() allows you to select a subset of rows in a data frame. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame:

For example, we can select all flights on November 3rd that were from American Airlines (AA) with:

In [32]:
head(filter(flights,month==11,day==3,carrier=='AA'))
Out[32]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
12013113538545-7824855-31AA2243N5DWAAJFKMIA14410895452013-11-03 05:00:00
22013113556600-4900905-5AA1175N3CSAALGAMIA1481096602013-11-03 06:00:00
32013113604610-6844855-11AA1103N3KDAALGADFW19213896102013-11-03 06:00:00
42013113624629-5907929-22AA1205N3EJAAEWRMIA14110856292013-11-03 06:00:00
52013113625630-5736805-29AA303N4WJAALGAORD1137336302013-11-03 06:00:00
62013113653655-29259205AA1263N634AAJFKLAS30622486552013-11-03 06:00:00

This is a lot simpler than the normal way to do this with a dataframe:

In [36]:
head(flights[flights$month == 11 & flights$day == 3 & flights$carrier == 'AA', ])
Out[36]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
12013113538545-7824855-31AA2243N5DWAAJFKMIA14410895452013-11-03 05:00:00
22013113556600-4900905-5AA1175N3CSAALGAMIA1481096602013-11-03 06:00:00
32013113604610-6844855-11AA1103N3KDAALGADFW19213896102013-11-03 06:00:00
42013113624629-5907929-22AA1205N3EJAAEWRMIA14110856292013-11-03 06:00:00
52013113625630-5736805-29AA303N4WJAALGAORD1137336302013-11-03 06:00:00
62013113653655-29259205AA1263N634AAJFKLAS30622486552013-11-03 06:00:00

slice()

We can select rows by position using slice()

In [37]:
slice(flights, 1:10)
Out[37]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
1201311517515283081911UA1545N14228EWRIAH22714005152013-01-01 05:00:00
2201311533529485083020UA1714N24211LGAIAH22714165292013-01-01 05:00:00
3201311542540292385033AA1141N619AAJFKMIA16010895402013-01-01 05:00:00
4201311544545-110041022-18B6725N804JBJFKBQN18315765452013-01-01 05:00:00
5201311554600-6812837-25DL461N668DNLGAATL116762602013-01-01 06:00:00
6201311554558-474072812UA1696N39463EWRORD1507195582013-01-01 05:00:00
7201311555600-591385419B6507N516JBEWRFLL1581065602013-01-01 06:00:00
8201311557600-3709723-14EV5708N829ASLGAIAD53229602013-01-01 06:00:00
9201311557600-3838846-8B679N593JBJFKMCO140944602013-01-01 06:00:00
10201311558600-27537458AA301N3ALAALGAORD138733602013-01-01 06:00:00

arrange()

arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

In [40]:
head(arrange(flights,year,month,day,air_time))
Out[40]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
120131123022200622342225349EV4276N13903EWRBDL241162202013-01-01 22:00:00
220131113181322-413581416-18EV4106N19554EWRBDL2511613222013-01-01 13:00:00
320131121162110622022212-10EV4404N15912EWRPVD2816021102013-01-01 21:00:00
420131120002000020542110-169E3664N836AYJFKPHL30942002013-01-01 20:00:00
520131120562004522156211244EV4170N12540EWRALB311432042013-01-01 20:00:00
6201311908915-710041033-29US1467N959UWLGAPHL32969152013-01-01 09:00:00

You can add desc() to arrange in descending order:

In [42]:
head(arrange(flights,desc(dep_delay)))
Out[42]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
12013196419001301124215301272HA51N384HAJFKHNL6404983902013-01-09 09:00:00
22013615143219351137160721201127MQ3535N504MQJFKCMH7448319352013-06-15 19:00:00
32013110112116351126123918101109MQ3695N517MQEWRORD11171916352013-01-10 16:00:00
42013920113918451014145722101007AA177N338AAJFKSFO354258618452013-09-20 18:00:00
520137228451600100510441815989MQ3075N665MQJFKCVG965891602013-07-22 16:00:00
620134101100190096013422211931DL2391N959DLJFKTPA13910051902013-04-10 19:00:00

select()

Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

In [43]:
head(select(flights,carrier))
Out[43]:
carrier
1UA
2UA
3AA
4B6
5DL
6UA

rename()

You can use rename() to rename columns, note this is not "in-place" you'll need to reassign the renamed data structures.

In [44]:
head(rename(flights,airline_car = carrier))
Out[44]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delayairline_carflighttailnumorigindestair_timedistancehourminutetime_hour
1201311517515283081911UA1545N14228EWRIAH22714005152013-01-01 05:00:00
2201311533529485083020UA1714N24211LGAIAH22714165292013-01-01 05:00:00
3201311542540292385033AA1141N619AAJFKMIA16010895402013-01-01 05:00:00
4201311544545-110041022-18B6725N804JBJFKBQN18315765452013-01-01 05:00:00
5201311554600-6812837-25DL461N668DNLGAATL116762602013-01-01 06:00:00
6201311554558-474072812UA1696N39463EWRORD1507195582013-01-01 05:00:00

distinct()

A common use of select() is to find the values of a set of variables. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table.

In [46]:
distinct(select(flights,carrier))
Out[46]:
carrier
1UA
2AA
3B6
4DL
5EV
6MQ
7US
8WN
9VX
10FL
11AS
129E
13F9
14HA
15YV
16OO

mutate()

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

In [50]:
head(mutate(flights, new_col = arr_delay-dep_delay))
Out[50]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hournew_col
1201311517515283081911UA1545N14228EWRIAH22714005152013-01-01 05:00:009
2201311533529485083020UA1714N24211LGAIAH22714165292013-01-01 05:00:0016
3201311542540292385033AA1141N619AAJFKMIA16010895402013-01-01 05:00:0031
4201311544545-110041022-18B6725N804JBJFKBQN18315765452013-01-01 05:00:00-17
5201311554600-6812837-25DL461N668DNLGAATL116762602013-01-01 06:00:00-19
6201311554558-474072812UA1696N39463EWRORD1507195582013-01-01 05:00:0016

transmute()

Use transmute if you only want the new columns:

In [52]:
head(transmute(flights, new_col = arr_delay-dep_delay))
Out[52]:
new_col
19
216
331
4-17
5-19
616

summarise()

You can use summarise() to quickly collapse data frames into single rows using functions that aggregate results. Remember to use na.rm=TRUE to remove NA values.

In [63]:
summarise(flights,avg_air_time=mean(air_time,na.rm=TRUE))
Out[63]:
avg_dep_delay
1150.6865

sample_n() and sample_frac()

You can use sample_n() and sample_frac() to take a random sample of rows: use sample_n() for a fixed number and sample_frac() for a fixed fraction.

In [55]:
sample_n(flights,10)
Out[55]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
1201335833838-5937957-20UA367N405UAEWRBOS402008382013-03-05 08:00:00
22013131804810-6105210448FL346N922ATLGAATL1307628102013-01-31 08:00:00
32013102519551945102201214318EV5306N738EVLGAGSO6346119452013-10-25 19:00:00
4201352422022100622343225350EV4700N15572EWRCLT795292102013-05-24 21:00:00
5201345180218002214021400AA177N323AAJFKSFO37125861802013-04-05 18:00:00
6201344634635-1731755-24UA1142N75410EWRBOS402006352013-04-04 06:00:00
7201382717161720-4192719207MQ3556N502MQLGADTW8050217202013-08-27 17:00:00
8201342611551200-512481315-279E3341N905XJJFKBOS331871202013-04-26 12:00:00
9201392319061908-222102220-10UA1515N16732EWRMIA14610851982013-09-23 19:00:00
102013829591003-411321135-3FL353N992ATLGACAK613971032013-08-02 10:00:00
In [60]:
# .005% of the data
sample_frac(flights,0.00005) # USE replace=TRUE for bootstrap sampling
Out[60]:
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
1201373064964368558550EV4393N12163EWRIND966456432013-07-30 06:00:00
22013419205218501221202136224B6527N651JBEWRMCO14993718502013-04-19 18:00:00
320139187406554584380934B6318N183JBJFKBOS421876552013-09-18 06:00:00
420131116559600-19289271UA665N534UAEWRSFO3572565602013-11-16 06:00:00
52013104559600-1710715-5WN464N917WNEWRMDW104711602013-10-04 06:00:00
6201331115001505-518131822-9UA1178N77525EWRAUS22815041552013-03-11 15:00:00
7201312116531655-219452010-25B6185N703JBJFKSAN337244616552013-01-21 16:00:00
8201388282305831251372B6718N329JBJFKBOS351872352013-08-08 23:00:00
9201310238437001031136101581VX399N626VAJFKLAX3322475702013-10-23 07:00:00
102013424180418004195619506AA353N564AALGAORD1157331802013-04-24 18:00:00
112013101016141620-617411755-14DL2443N336NBJFKBOS3618716202013-10-10 16:00:00
12201392011551200-513031334-31UA255N821UALGAORD1087331202013-09-20 12:00:00
13201382112591300-1152115192EV5148N176PQEWRATL1097461302013-08-21 13:00:00
1420132121202125-522332250-17MQ4660N508MQLGABNA11376421252013-02-01 21:00:00
1520132191548152622181018019UA647N567UAEWRLAS302222715262013-02-19 15:00:00
16201314835841-611071113-6EV4388N13994EWRJAX1318208412013-01-04 08:00:00
17201392834840-6101110074EV4594N22971EWRPIT533198402013-09-02 08:00:00

Conclusion

Hopefully you've seen how dplyr() can save you lots of time and headaches! Remember to use help() or just reference the documentation if you ever need help using the package!